Optimizing Shuffle Performance in Spark

نویسندگان

  • Aaron Davidson
  • Andrew Or
چکیده

Spark [6] is a cluster framework that performs in-memory computing, with the goal of outperforming disk-based engines like Hadoop [2]. As with other distributed data processing platforms, it is common to collect data in a manyto-many fashion, a stage traditionally known as the shuffle phase. In Spark, many sources of inefficiency exist in the shuffle phase that, once addressed, potentially promise vast performance improvements. In this paper, we identify the bottlenecks in the execution of the current design, and propose alternatives that solve the observed problems. We evaluate our results in terms of application level throughput.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scaling Spark on Lustre

We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. We find that design decisions made in the development of Spark are based on the assumption that Spark is constrained primarily by network latency, and that disk I/O is comparatively cheap. These assumptions are not valid on Edison or Co...

متن کامل

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a s...

متن کامل

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Multiway spatial join plays an important role in GIS (Geographic Information Systems) and their applications. With the increase in spatial data volumes, the performance of multiway spatial join has encountered a computation bottleneck in the context of big data. Parallel or distributed computing platforms, such as MapReduce and Spark, are promising for resolving the intensive computing issue. P...

متن کامل

MapReduce with communication overlap (MaRCO)

MapReduce is a programming model from Google for cluster-based computing in domains such as search engines, machine learning, and data mining. MapReduce provides automatic data management and fault tolerance to improve programmability of clusters. MapReduce’s execution model includes an all-map-to-all-reduce communication, called the shuffle, across the network bisection. Some MapReductions mov...

متن کامل

A Workload-Specific Memory Capacity Configuration Approach for In-Memory Data Analytic Platforms

Nowadays, in-memory data analytic platforms, such as Spark, are widely adopted in big data processing. The proper memory capacity configuration has been proved to be an efficient way to guarantee the workload performance in such platforms.Currently, Spark adopts the static way to configure the memory capacity for workloads based on user specifications. However, due to the lack of deep knowledge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013